Best Evaluation Benchmark AI Tools & Models - Premium Evaluation Benchmark News

AI News

OpenAI Criticizes AI Evaluation Benchmark: 731 Questions, Nearly a Third Have Flaws. 8-Month Passing Rate Rises from 23% to 80%, Now Ineffective

OpenAI publicly questioned the SWE-Bench Pro benchmark, pointing out that about 30% of its 731 test tasks have evaluation flaws. The benchmark, launched by Scale AI, is an industry authority for measuring large model programming capabilities. However, OpenAI warned that the passing rate of cutting-edge models has surged from 23.3% to 80.3% within 8 months, which is unusually fast, indicating doubts about the reliability of the evaluation.

74k 14 minutes ago

OpenAI Criticizes AI Evaluation Benchmark: 731 Questions, Nearly a Third Have Flaws. 8-Month Passing Rate Rises from 23% to 80%, Now Ineffective

New Metrics for Agent Evolution: ByteDance Seed Launches EdgeBench Benchmark

The ByteDance Seed team has introduced EdgeBench, an ultra-long-range evaluation dataset, focusing on measuring agents' ability to learn continuously in real-world environments. The benchmark covers six fields, including 134 real-world tasks, each requiring agents to complete multi-step interactions and dynamically adapt to environmental changes, providing a quantitative reference for research on long-term autonomous learning.

16.3k 17 hours ago

AI Audio Editing Enters a New Era: Tencent Hunyuan Collaborates with Leading Institutions to Release the MMAE Benchmark. Current Model Precision in Audio Editing is Less Than 5%

Tencent Hunyuan, in collaboration with Shanghai Jiao Tong University, Nanyang Technological University of Singapore, Tianjin University, Peking University, and Fudan University, has launched the first general instruction-driven audio editing benchmark dataset, MMAE. This benchmark addresses the current limitations in AI's ability to edit audio, filling a gap in the field of audio generation and providing an important evaluation standard for multi-task audio editing research.

78.5k 10 hours ago

Breakthrough in Domestic Embodied Intelligence! Yueyang Releases the Kongyi Large Model, with Standard Task Success Rate Exceeding 99%

A breakthrough in embodied intelligence: Yuejiang Robot releases its self-developed world action model 'DobotWAM,' advancing complex real-world task understanding and execution. It passes LIBERO benchmarks, including Spatial and Object evaluations, confirming core capabilities.....

15.5k 1 days ago

AI Products

Movie Gen Bench

Video Generation Evaluation Benchmark

Video generation

12k

SuperCLUE

Leading AI evaluation benchmark for measuring and comparing AI model performance.

AI model

12.5k

VQAScore

VQAScore, a novel evaluation metric and benchmark for text-to-vision generation, is introduced. VQAScore, based on the CLIP-FlanT5 model, achieves state-of-the-art performance in evaluating text-to-image/video/3D generation. It serves as a powerful alternative to CLIPScore. GenAI-Bench, a benchmark dataset, provides real-world testing texts with rich semantic combinations, allowing for a comprehensive assessment of generative model performance.

AI image generation

10.9k

Models

Claude 3 Opus

Anthropic

$105

Input tokens/M

$525

Output tokens/M

200

Context Length

qwen-image-edit

Alibaba

Input tokens/M

Output tokens/M

Context Length

Doubao-1.5-pro-32k

Bytedance

$0.8

Input tokens/M

Output tokens/M

128

Context Length

Qwen3-235B-A22B-Instruct-2507

Alibaba

Input tokens/M

Output tokens/M

Context Length

Hunyuan-Large-Vision

Tencent

Input tokens/M

Output tokens/M

Context Length

GLM-4.5-X

Chatglm

Input tokens/M

$16

Output tokens/M

128

Context Length

Hunyuan-TurboS-latest

Tencent

$0.8

Input tokens/M

Output tokens/M

Context Length

QianfanHuijin-Reason-8B

Baidu

Input tokens/M

Output tokens/M

Context Length

QianfanHuijin-Reason-70B

Baidu

Input tokens/M

Output tokens/M

Context Length

Hunyuan-Functioncall

Tencent

Input tokens/M

Output tokens/M

Context Length

o3

Openai

$14

Input tokens/M

$56

Output tokens/M

200

Context Length

Hunyuan-Turbo

Tencent

$2.4

Input tokens/M

$9.6

Output tokens/M

Context Length

Hunyuan-TurboS-Longtext-128k-20250325

Tencent

$1.5

Input tokens/M

Output tokens/M

128

Context Length

Hunyuan-Standard

Tencent

$0.8

Input tokens/M

Output tokens/M

Context Length

Baichuan-M2-32B

Baichuan

Input tokens/M

Output tokens/M

Context Length

ERNIE X1.1 Preview

Baidu

Input tokens/M

Output tokens/M

Context Length

Baichuan2-Turbo

Baichuan

Input tokens/M

Output tokens/M

Context Length

Hunyuan-Lite

Tencent

Input tokens/M

Output tokens/M

250

Context Length

kimi-latest-128k

Moonshot

$10

Input tokens/M

$30

Output tokens/M

131

Context Length

ERNIE-4.5-VL-424B-A47B-Paddle

Baidu

Input tokens/M

Output tokens/M

Context Length

MCP

AWorld

AWorld is a multi - agent system framework aiming to bridge the gap between theoretical MAS capabilities and practical applications, providing a full - set solution from single - agent to multi - agent collaboration/competition. The project supports scenarios such as browser/mobile operations and GAIA benchmark testing, adopts a client - server architecture, integrates a rich toolchain, and includes performance evaluation and training functions.

python

10.6k

2.0points

Empowering the future, your artificial intelligence solution think tank

English 简体中文繁體中文にほんご

FirendLinks:

AI Newsletters AI Tools MCP Servers AI News AI Marketing LLM Leaderboard AI Ranking

Business Cooperation Site Map

AI News

OpenAI Criticizes AI Evaluation Benchmark: 731 Questions, Nearly a Third Have Flaws. 8-Month Passing Rate Rises from 23% to 80%, Now Ineffective

New Metrics for Agent Evolution: ByteDance Seed Launches EdgeBench Benchmark

AI Audio Editing Enters a New Era: Tencent Hunyuan Collaborates with Leading Institutions to Release the MMAE Benchmark. Current Model Precision in Audio Editing is Less Than 5%

Breakthrough in Domestic Embodied Intelligence! Yueyang Releases the Kongyi Large Model, with Standard Task Success Rate Exceeding 99%

AI Products

Movie Gen Bench

SuperCLUE

VQAScore

Models

Claude 3 Opus

qwen-image-edit

Doubao-1.5-pro-32k

Qwen3-235B-A22B-Instruct-2507

Hunyuan-Large-Vision

GLM-4.5-X

Hunyuan-TurboS-latest

QianfanHuijin-Reason-8B

QianfanHuijin-Reason-70B

Hunyuan-Functioncall

o3

Hunyuan-Turbo

Hunyuan-TurboS-Longtext-128k-20250325

Hunyuan-Standard

Baichuan-M2-32B

ERNIE X1.1 Preview

Baichuan2-Turbo

Hunyuan-Lite

kimi-latest-128k

ERNIE-4.5-VL-424B-A47B-Paddle

Qwen3 4B I 1509

GLM 4.5 GGUF

DeepSeek R1 0528 Bf16

YugoGPT Florida_Q8_0 GGUF

Sarashina2.2 1b Instruct V0.1

MMIE Score

M3D LaMed Llama 2 7B

Stella Mrl Large Zh V3.5 1792d

YanoljaNEXT EEVE Instruct 2.8B

Skywork 13B Base

Stella Base Zh V2

Flaubert_large_cased

Cicero Similis

MCP

AWorld